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ABSTRACT 

Storage devices based on flash memory have replaced hard 
disk drives (HDDs) due to their superior performance, in¬ 
creasing density, and lower power consumption. Unfortu¬ 
nately, flash memory is subject to challenging idiosyncrasies 
like erase-before-write and limited block lifetime. These 
constraints are handled by a flash translation layer (FTL), 
which performs out-of-place updates, wear-leveling and garbage- 
collection behind the scene, while offering the application a 
virtualization of the physical address space. 

A class of relevant FTLs employ a flash-resident page- 
associative mapping table from logical to physical addresses, 
with a smaller RAM-resident cache for frequently mapped 
entries. In this paper, we address the problem of perform¬ 
ing garbage-collection under such FTLs. We observe two 
problems. Firstly, maintaining the metadata needed to per¬ 
form garbage-collection under these schemes is problematic, 
because at write-time we do not necessarily know the phys¬ 
ical address of the before-image. Secondly, the size of this 
metadata must remain small, because it makes RAM un¬ 
available for caching frequently accessed entries. We pro¬ 
pose two complementary techniques, called Lazy Gecko and 
Logarithmic Gecko, which address these issues. Lazy Gecko 
works well when RAM is plentiful enough to store the GG 
metadata. Logarithmic Gecko works well when RAM isn’t 
plentiful and efficiently stores the GC metadata in flash. 
Thus, these techniques are applicable to a wide range of 
flash devices with varying amounts of embedded RAM. 

1. INTRODUCTION 

In recent years, usage of storage devices based on NAND 
flash memory such as eMMCs (embedded multimedia card) 
and SDDs (solid state drives) has been increasing at an ex¬ 
ponential rate. The benefits of NAND flash include superior 
performance relative to HDDs, shock-resistance, a gradually 
increasing storage density, and lower power consumption. 

However, flash memory is subject to challenging idiosyn¬ 
crasies [T]. In particular, flash is organized into erase-blocks. 


Data can be written sequentially within an erase block, but 
any update to this data, however small, must be preceded by 
an erase operation, which is expensive and works at a block 
granularity. Moreover, an erase-block becomes increasingly 
prone to data errors as a function of the number of erases it 
endures. 

SSDs use a software layer called the flash translation layer 
(FTL) to manage these characteristics. An FTL’s main 
job is to implement out-of-place updates to avoid having 
to erase and rewrite an entire block for every small data up¬ 
date. The FTL does this by providing a mapping scheme 
from logical to physical addresses, and a garbage-collection 
mechanism to reclaim invalid space. The FTL also performs 
wear-leveling to ensure blocks across the device wear out at 
the same rate. The FTL typically stores metadata in RAIVQ 
module embedded within the flash device. 

The simplest implementation of a mapping scheme is a 
RAM-resident array that maps logical to physical addresses 
[mis]. The problem is that the size of this table is often 
too large to fit into the RAM typically embedded in a flash 
device. Indeed, for consumer SSDs and portable electronics, 
the cost and size of a flash device is a central priority m- 
Manufacturers typically strive to reduce this cost by provid¬ 
ing as little RAM as possible, especially since the cost per 
byte of flash memory is scaling faster than the cost per byte 
for SRAM |T4l[6|. 

Over the past decade, numerous RAM-efhcient mapping 
schemes have been proposed, as captured by a recent sur¬ 
vey M- Flash-resident page-associative schemes, particu¬ 
larly DFTL and LazyFTL, have been acknowledged as the 
most efficient among the schemes [a Ham]. Such schemes 
store a page-associative mapping table in flash, and cache 
frequently accessed entries in the available RAM. 

However, the design of such schemes as described in the 
literature is incomplete, and the missing piece is garbage- 
collection. The two challenges we identify and tackle in 
this paper concern the metadata needed by the garbage- 
collection module to select a victim block and to infer which 
pages on the victim block are still valid. We show that 
maintaining this metadata is a challenge because at write¬ 
time, we do not necessarily know the physical address of 
the before-image. We also identify the size of the metadata 
required for garbage-collection as an issue for flash devices 
that have little RAM. 

We propose two complementary schemes that address these 
problems. The first is called Lazy Gecko, standing for Lazy 


^SRAM is often used rather than DRAM due to its high 
performance and low power consumption |14| 



Garbage-Collector. This scheme uses a flash-resident re¬ 
verse map and a RAM-resident bitmap to enable victim- 
selection and live-page-identification. This scheme is feasi¬ 
ble for SSDs with a moderate amount of embedded RAM. 
The second scheme, called Logarithmic Gecko, is designed 
for flash devices with very little RAM, such as eMMCs. Log¬ 
arithmic Gecko is similar to Lazy Gecko, but it stores most 
of its metadata on flash. This requires a moderate amount 
of internal lOs to maintain and access, but it leaves a sig¬ 
nificantly lower RAM footprint. 

Our contributions in this paper is as follows: 

• We introduce the problem of maintaining the metadata 
needed to enable garbage-collection on flash-resident 
page-mapping FTLs. 

• We propose two complementary techniques for solving 
the garbage-collection metadata problem. These tech¬ 
niques are suitable for a wide range of flash devices 
with varying amounts of RAM. 

• We evaluate these techniques using simulation and demon¬ 
strate their impact on write-amplification and read- 
amplification. 

2. RELATED WORK 

2.1 Flash Devices 

Flash devices store data in NAND chips, each of which 
is organized into independent arrays of memory cells. Each 
cell accommodates 1, 2 or 3 bits (SLG, MLC or TLG). An 
array is a flash block, and a row within the array is a flash 
page. A flash page can typically store 4-16 KB and a block 
typically contains 128-512 pages respectively. 

Flash devices are subject to challenging idiosyncrasies. An 
erase operation must precede an update to a page, and an 
erase has the granularity of a block. Moreover, blocks have 
a limited lifetime in terms of erases [I]. 

Unfortunately, as the density of flash devices is increasing, 
their reliability and the lifetime of blocks are decreasing . 
Bit-shifts may occur both when writing and reading a page 
due to electrical side-effects. To mitigate this risk, two ad¬ 
ditional constraints on writes are generally applied for MLG 
and TLC devices. Writes have a minimal granularity of a 
flash page, and writes must take place sequentially within a 
block 0. 

Each flash page is given an adjacent spare out-of-bound 
(OOB) area for storing metadata about a page. It is typi¬ 
cally smaller than the page itself by a factor of 32 and con¬ 
tains the logical address of a page as well as error-correction 
codes. 

2.2 Log-Based Hybrid FTL Schemes 

In log-based hybrid schemes [14], the blocks in the SSD are 
divided into two types: data blocks and log blocks. Logical 
pages with adjacent addresses are stored in the order of their 
addresses on data blocks. There is a mapping table in RAM 
from logical blocks to physical blocks. A page can be looked 
up by determining its block from the mapping and adding 
its modulo offset within the block. The remaining log blocks 
are used to buffer updates. A log block is page-associative, 
meaning page updates on it can store updates in any order. 

A page-mapping for all log blocks is stored in RAM. 


When space for incoming updates becomes limited, a log 
block is selected and merged with the data block that have 
updates on it. The more data blocks that have a page up¬ 
date on a log block, the more the cost of the merge increases, 
as more data blocks must be erased and rewritten. In the 
Set-Associative Sector Translation (SAST) scheme [16], log 
blocks are set-associative. This means that the logical ad¬ 
dress space is divided into equally sized sets, and a given 
log block can only accept updates from pages that belong 
in a given set. This limits the cost of a merge by restricting 
the number of data blocks that can be associated with a log 
block. SAST is used in practice by the very recent F2FS 
[H] file system for iMMG devices. Other log-based hybrid 
schemes of interest are BAST [^ and FAST [12], whereby 
log blocks are block-associative and fully-associative respec¬ 
tively. 

2.3 Flash-resident page-associative schemes 

In flash-resident page-associative schemes, the mapping 
table is page-associative, meaning that a logical page can 
be written on any physical page. Since this requires a large 
mapping table (typically 4 bytes per flash page), this ta¬ 
ble is stored and maintained in flash. Frequently accessed 
mapping entries are cached in RAM | 14| . 

For example, in DFTL [5] a mapping page stores mapping 
entries for adjacent logical addresses. The mapping in flash 
is updated lazily. This means that when a page is updated, 
the updated mapping entry will first only exist in the RAM- 
resident mapping cache. Such an entry is labelled dirty. 
The cache uses a LRU page-replacement strategy to evict 
entries as it runs out of space. If a dirty entry is evicted, its 
corresponding mapping page in flash must be read, updated 
and rewritten. 

LazyFTL m is an interesting variation of DFTL that 
separates hot and cold data and strives to provide better 
consistency guarantees to avoid losing cached addresses at 
the event of power failure. 

Flash-resident page-associative schemes tend to involve 
significantly lower write-amplification than log-based hybrid 
schemes. The reason is that there is complete flexibility in 
where we can store a page. Thus, we avoid the potentially 
expensive merge operations that are inherent in log-based 
hybrid schemes. Instead, we can simply pick a block with 
few live pages, migrate these pages into other blocks with 
free space, and erase the block. A lower write-amplification 
implies better performance, device longevity and reliability. 

The challenge with page-associative schemes, however, is 
that they rely on high temporal locality in the data. The 
larger the working set in the workload is, the more cache 
misses and evictions occur, which lead an increase and read- 
amplification and write-amplification. It is therefore desir¬ 
able to allocate as much of the available RAM as possible 
to the cache for frequently accessed entries. 

2.4 Shortcomings of Existing Schemes 

Although flash-resident page-associative mapping schemes 
are the state-of-the-art in terms of performance and relia¬ 
bility, existing schemes have problems. In particular, DFTL 
does not address the problem of how to maintain metadata 
to determine which pages are valid and should be migrated 
as a part of a garbage-collection operation. 

LazyFTL does address this problem, but its design re¬ 
lies on an obsolete assumption the OOB area of a page 


can store a bit that indicates if the page is valid or in¬ 
valid. This assumptions dates back to 1995 in the design 
of a Flash-File System [S], whose design assumed that a 
part of the OOB could be left un-programmed initially, and 
programmed later to indicate when the page has become 
invalid. Unfortunately, as we saw in Section[2]T] a new con¬ 
straint recently emerged that pages should be programmed 
sequentially within a block to minimize electrical side effects. 
The design of LazyFTL violates this constraint. 

The problem of maintaining metadata for garbage-collection 
has not been addressed adequately in any work we are aware 
of. It is concerning that the recent F2FS file system uses the 
SAST FTL as opposed to a state-of-the-art page-associative 
schemes like DFTL and LazyFTL. The reason may be that 
page-associative schemes are still yet to have reached ma¬ 
turity. We hope that this paper would help in filling this 
gap. 

3. PROBLEM DEFINITION 

In the context of flash-resident page-associative FTLs, 
any garbage-collection scheme must answer two questions: 
which block to reclaim next, and which pages are still valid 
on this victim block? As we will now see, those questions are 
straight-forward to answer when SRAM is abundant and the 
mapping table can fit into RAM. As we will later see, those 
questions are much more difficult to answer when RAM is 
scarce and the mapping table is in flash. Table [T] lists terms 
we use throughout the paper. We refer to the logical to 
physical mapping table as page-mapping. 

3.1 Abundant RAM 

Assuming RAM is plentiful enough to store the entire 
page-mapping, let us survey a few techniques for performing 
victim-selection and live-page-identification. 

3.1.1 Victim Selection 

Greedy. A simple method for victim selection is to main¬ 
tain a mapping from block id to the number of valid pages 
in each block. To select a GC victim, we scan this mapping 
and choose the block with the least number of live pages. To 
maintain this mapping, we must know the physical address 
of the before-image of every update. We use the address of 
the before image to decrement the counter for the block in 
which the before-image resides. 

LRU: A different technique for victim-selection is least 
recently used (LRU), which selects the block that was erased 
the longest time ago. The rationale is that this block is likely 
to contain the least number of live pages. Typically, this 
requires maintaining a queue of blocks. A block is inserted 
into the queue when it is written, and a victim is selected by 
popping the queue. The issue of the LRU scheme is that it 
may involve more migrations than the greedy scheme since 
we have no guarantee we have actually chosen the block with 
the least number of live pages. 

Window-greedy: A compromise between the LRU and 
greedy policies is window-greedy. It implements a block 
queue like the LRU policy and applies the greedy policy 
only to the front X blocks in the queue. This allows avoiding 
the potentially CPU-expensive scan of the greedy algorithm, 
and to increase the chance of finding a block with few live 
pages relative to the LRU policy. 

Note that some methods also choose a victim based on 
age i. Such methods essentially integrate the wear-levelling 


and garbage-collection schemes. In this work we just concen¬ 
trate on garbage-collection. Some works also separate pages 
into groups based on update frequency and perform garbage- 
collection independently within each group [18]. This is also 
outside of the current scope. 

3.1.2 Live Page Identification 

Once a victim has been chosen, we need to check which 
pages in it are still valid. Three techniques are possible. 

page-mapping scan: We can scan the entire page-mapping 
to find all live pages that are on the target block. However, 
this scan may become a CPU bottleneck. 

page-validity-bitmap (PVB): A less CPU-intensive al¬ 
ternative is to use a page-liveness-bitmap, which tracks which 
pages in the SSD are valid and which are invalid. Pages clus¬ 
tered based on which block they are on. To maintain this 
map, we must know the physical location of the before-image 
of each write to shift the corresponding bit in the bitmap. 
Note that if we use the greedy policy for victim selection, 
the PVB can be used to keep track of the number of live 
pages in a block by taking the Hamming weight of the bits 
associated with a given block. 

flash-reverse-mapping: Yet another alternative is to 
store a flash-reverse-mapping in the out-of-bound compo¬ 
nent of each block. This mapping indicates which logical 
pages are written in each of the physical pages on the block. 
It is updated when the block is written. In order to iden¬ 
tify which pages are valid, we read this map before starting 
a GC operation. We look up each logical addresses in the 
page-mapping table and check if the physical address still 
corresponds to the block we are targeting. If so, then the 
page is valid. 

Note that the above three techniques assume that page¬ 
mapping is in RAM. In the next section, we examine the 
problems that arise when most of page-mapping is stored in 
flash. 

3.2 Scarce RAM 

When RAM is scarce and most of page-mapping is stored 
in flash, challenges arise. The essential problem in all cases 
is that when a write arrives, we may not know the physi¬ 
cal location of its before-image. Thus, we cannot keep the 
metadata up-to-date. Let us re-examine the policies from 
the last subsection in this context. 

3.2.1 Victim Selection 

Greedy and window-greedy : In order to keep track of 
the number of live pages per block, we must know the phys¬ 
ical location of the before-image to decrement the appropri¬ 
ate block counter. However, if the page’s mapping entry is 
not cached, the address of the before image is unavailable, 
and we don’t know which counter to decrement. We can look 
it up in page-mapping in flash using a read 10, but doing so 
for each write can severely increase read-amplification. 

LRU: The LRU policy is unaffected by moving page¬ 
mapping to flash as it does not rely on accessing it. We 
leverage this later. 

3.2.2 Live Page Identification 

page-mapping scan: Scanning page-mapping to deter¬ 
mine which pages are still valid in each target block would 
cripple performance, since page-mapping is in flash and com¬ 
prises thousands of flash pages. 


Table 1: Terms 


Term 

Description 

Micron P420m 

Intel 525 series 

K 

Number of blocks in the SSD 

-2TS 


B 

Number of pages per block 

512 

128 

P 

Size of a page 

16 KB 

4 KB 

OP 

Over-provisioning, measured as 1 minus the size 
of the logical address space over the size of the 
physical address space 

30% 

7% 

PBA 

K ■ B (number of physical pages) 



LBA 

K ■ B ■ {1 — OP) (number of logical pages) 



a 

Page or block address. Assume 4 bytes. 



R 

Size ratio between adjacent levels in Logarithmic 
Gecko’s LSM-tree 



L 

Maximum number of levels in the LSM-tree 




page bitmap: Maintaining a RAM-resident bitmap that 
indicates which page is live is problematic. The reason is 
that when we update a page, we don’t know the physi¬ 
cal address of the before-image if it is not pre-cached in 
page-mapping cache. Thns, we cannot shift the appropri¬ 
ate bit in the bitmap to indicate the page is now invalid. 
We can access page-mapping to find the before-image, but 
doing so for each write is going to significantly increase read- 
amplification. 

flash-reverse-mapping: This policy, which involves read¬ 
ing a reverse-mapping from the OOB part of the target 
block, is also impractical. The reason is that for each en¬ 
try in the reverse mapping, we need to access page-mapping 
to determine whether the logical page is still on the same 
block. Typically, most of these addresses will not be cached, 
and so page-mapping may need to be accessed up to B 
times for each GC operation. This greatly increases read- 
amplification. 

3.3 Problem Summary 

The Bookkeeping Maintenance Problem: The meta¬ 
data needed for victim-selection and live-page-identification 
requires being maintained for every page update with in¬ 
formation about the physical address of the before-image. 
However, the mapping entry with the before-image may not 
be cached. The naive solution is to access page-mapping us¬ 
ing read lOs to find the before-image’s physical address, but 
doing so too much may lead to unacceptable read-amplification. 

The Scarce RAM Problem: Another problem is that 
this metadata may consume a substantial amount of RAM. 
The RAM consumed by this metadata becomes unavailable 
for caching frequently accessed entries, which degrades per¬ 
formance. It also limits the ability of SSDs to scale, as RAM 
is more expensive than flash |14| . 

In the next two sections, we propose two novel garbage- 
collection algorithms, namely Lazy Gecko and Logarithmic 
Gecko. The former only addresses problem 1. The latter is 
an extension of the former that also addresses problem 2. 

4. SYSTEM MODEL 

Before describing the two schemes, let us outline some 
assumptions about the underlying system. We consider an 
SSD whose architecture is captured by the terms in table 
[T] We assume that this SSD does not have sufficient RAM 
for storing an entire page-mapping in flash, and that a flash- 
resident page-associative FTL is used to store page-mapping 


in flash. For concreteness and for the experimental evalua¬ 
tion later, we assume the mapping scheme is DFTL as de¬ 
scribed in [^. However, the techniques introduced in this 
paper are in principle also applicable to LazyFTL, or any 
other flash-resident page-associative scheme. 

DFTL stores the mapping table in translation pages in 
flash. These translation pages occupy separate blocks from 
user data. The reason for this is that translation pages are 
updated much more frequently, and it is considered bene¬ 
ficial for performance to separate pages of different update 
frequencies on different flash blocks [m a [TO]. There is 
a pool of free blocks, and one active translation block and 
one active data block on which translation pages and data 
pages are written respectively. When an active block of ei¬ 
ther groups runs out of space, a new one is requested from 
the pool of free blocks. When the pool of free blocks is 
nearly empty, the garbage-collection mechanism is invoked. 

A RAM-based table called the Global Mapping Directory 
(GMD) stores the locations of translation pages in flash. 
Finally, DFTL stores frequently accessed mapping entries 
in a RAM-based table called the Gached Mapping Table 
(GMT). We refer to an entry in the GMT as ’’dirty” if it is 
not synchronized with the corresponding mapping entry in 
flash. 

5. LAZY GECKO 

Here we introduce Lazy Gecko, which stands for Lazy 
Bookkeeping Garbage-Collector. Lazy Gecko combines a 
few of the techniques described in the problem definition to 
solve the bookkeeping maintenance problem. 

5.1 Data Structures 

In terms of data structures. Lazy Gecko uses a RAM- 
resident bitmap called the Page Validity Bitmap (PVB). 
There is one bit for each page in the system. A bit is set to 
1 if the page is invalid, and 0 otherwise. It is updated using 
the following trivial algorithm [T] which we override later in 
the design of Logarithmic Gecko. 

Algorithm: invalidate() 

Input: physicaLaddress pa 
1 PVB [pa] = 1 

Algorithm 1: marks a physical page as invalid. 

Lazy Gecko also maintains a flash-resident Reverse-Map, a 
mapping from physical pages to the logical pages that were 















last written on them. The reverse-map is stored on flash 
pages rather than the OOB component of flash pages. This 
map occupies a small percentage of flash (approximately 
0.02% in the devices in table [TJ. It is arranged such that 
the mapping entries of all physical pages from the same flash 
block are stored on the same flash translation page. A RAM- 
resident Global-Reverse-Mapping-Directory is used to keep 
track of the whereabouts of the flash pages in this map. 
Translation pages belonging to the reverse map are stored 
on separate flash blocks, which we call Reverse Blocks. The 
reverse map is updated as follows. Whenever a flash block 
containing user data is written, its corresponding reverse 
translation page is read, updated to reflect the new logical 
addresses written to the block, and written to flash on a 
Reverse Block with free space. Maintaining this map en¬ 
tails a modest overhead of one read and one write lO per 
garbage-collection operation. 

Finally, we add one bit to each entry in the cached map¬ 
ping table called the ’’synch flag”. This flag is set to 0 if 
a logical address in the cache has some before-image whose 
physical address has still not been set to invalid in PVB, 
and 1 otherwise. 

5.2 Operations 

The crux of Lazy Gecko is how to maintain the PVB up- 
to-date in order to allow performing victim-selection and 
live-page-identification. This is done by adding some logic 
to the following operations. 

5.2.1 Application Write 

When an application page update arrives, we follow algo¬ 
rithm [21 This algorithm checks if the current mapping entry 
for the logical address is in GMT. If not, we insert it into 
the GMT, and the synchronization flag is set to false. If it is 
cached, we invoke algorithm[T]to invalidate the page’s former 
physical address. However, we do not set the synchroniza¬ 
tion flag to true, as there may be another physical page on 
which the logical page once resided that is still marked as 
valid in the PVB. Lastly, we execute the write and update 
the cached entry with the new physical address of the logical 
page. 

Note that this procedure creates false-positives in the PVB. 
In other words, there may be pages in the bitmap marked 
as valid that are actually invalid. We show how to resolve 
these false positives later. 

Algorithm: Handle Write 

Input: page_write 

1 la = page_write.logicaLaddress; 

2 if cache, contains (la) then 

3 pa = cache [la].physicaLaddress; 

4 invalidate(pa); 

5 else 

6 cache, insert (la); 

7 cache [laj.synchjlag = false; 

8 end 

9 cache[la].physicaLaddress = ssd.write(page_write); 

10 cache[la] .dirty = true ; 

Algorithm 2: Handles an application page write. 


5 . 2.2 Application Read 


We handle an application read as follows. If the map¬ 
ping address is cached, we simply execute the read. If it 
is not cached, we issue a read 10 to the appropriate flash 
translation page. We then insert the mapping entry into the 
GMT with the synchronization flag set to true. We omit an 
algorithm listing for this operation due to space constraints. 

5.2.3 Translation Page Read 

We resolve false positives in PVB by piggybacking some 
logic onto routine operations that take place in the back¬ 
ground all the time, namely (1) cache misses, (2) cache evic¬ 
tions, and (3) GC migrations targeting translation pages. 
Whenever one of these operations takes place, we invoke al¬ 
gorithm (3] which iterates through all the mapping entries in 
the translation page. If any of them is in GMT with the syn¬ 
chronization flag marked as false, it means that the physical 
page on which the logical page was at some point written has 
not been marked as invalid in the PVB. We correct this by 
setting the corresponding bit in PVB to 1. We also set the 
synch flag for the cached entry to true, because at this point 
there can be no other before-images for the logical address 
that are still unsynchronized. The reason is that algorithm 
[3] is always called as a result of a page eviction, so a page is 
always synchronized when it is evicted. 

Algorithm: Lazy Updates 
Input: mapping_page 

1 forall the entries in mapping.page do 

2 la = entry.logi_addr; 

3 pa = entry.phys_addr; 

4 if cache, contains (la) and ! cache [la], synch.flag then 

5 invalidate(pa); 

6 cache [laj.synchjlag = true; 

7 end 

8 end 

Algorithm 3: Detects and resolves false positives. 

Note that this algorithm [3] cannot be a bottleneck as the 
number of entries in a translation page is typically between 
2^° and 2^^. Iterating through an array of this size takes 
hundreds of nanoseconds, whereas the cost of flash opera¬ 
tions is in the order of tens to hundreds of microseconds. 

5.2.4 Garbage-Collection 

In terms of victim-selection. Lazy Gecko is compatible 
with the Greedy and Window-Greedy, as the PVB can be 
scanned to select the block with the fewest live pages. The 
number of live pages in a block is given by taking the Ham¬ 
ming Weight (number of non-zero bits) of its bitmap. 

Live-page-identification works by referring to PVB. How¬ 
ever, when a block is chosen for garbage-collection, false 
positives may still exist in the PVB. To resolve these, we 
use the following key insight: if a physical page marked as 
valid PVB is actually invalid, a dirty mapping entry for the 
logical page which was last written on it must still be the 
Cached Mapping Table. The reason for this is that when 
a dirty mapping entry is evicted from the cached mapping 
table, the PVB is updated. Thus, if PVB is not up-to-date, 
it means a page eviction never took place, and the logical 
entry with its current physical page address are cached. 

We exploit this insight as follows in algorithm (4] which 
resolves any remaining false positives. We read the reverse 
translation page from Reverse Mapping corresponding to 






the victim block, and iterate through the logical addresses 
that were last written to this block. If any of them is in 
the cached mapping table with the synchronization flag set 
to false, we know that the physical page in the victim is in 
fact invalid. This is all we need to complete the live-page 
identification. 

Algorithm: Garbage Collection 
Input: victim block b 

1 originaLblockjnapping = ssd.read_reverse_mapping(b) 
forall the entries in originalMockjmapping do 

2 la = entry.logicaLaddress; 

3 if cache, contains (la) and ! cache [la] .synch^flag then 

4 pa = entry.physical_address; 

5 invalidate(pa) 

6 end 

7 end 

Algorithm 4: 

When a block is erased, we reset the bits corresponding 
to its physical pages in the PVB. We also update Reverse 
Mapping, as described in subsection l5.ll 

5.3 Reflection 

We showed that in order to resolve the garbage-collection 
metadata maintenance problem. Lazy Gecko maintains a 
page-validity-bitmap (PVB) in RAM and stores a reverse 
map in flash. The only 10 overhead is introduced is 1 flash 
read and 1 flash write to update the reverse-mapping for 
each garbage-collection operation. This overhead is rela¬ 
tively modest. 

The main problem of Lazy Gecko is that the amount of 
RAM needed is proportional to the number of pages in the 
SSD, as 1 bit is needed for each page. This may be an 
issue for SSDs with little RAM. For instance, an SSD of the 
same dimensions as the Micron P420m in table [T] requires 
16 megabytes for the bitmap. This is a hard limit. An SSD 
with less RAM than this cannot use Lazy Gecko, and can 
therefore not use a flash-resident page-associative scheme. 
And even if the SSD has enough RAM to store the PVB, the 
RAM consumed by PVB is unavailable for storing frequently 
accessed entries in GMT, thereby degrading performance. 
These problems are addressed in the next section. 

6. LOGARITHMIC GECKO 

We now introduce Logarithmic Gecko, which stands for 
Logarithmic Garbage-Collector. It is very similar to Lazy 
Gecko, the main difference being that in Logarithmic Gecko 
the Page Validity Map is stored in flash to save RAM. This 
reduces the RAM footprint by as much as 97% relative to 
Lazy Gecko for flash devices that are on the market to¬ 
day. Logarithmic Gecko requires a modest number of lOs 
to maintain the flash-resident PVB. 

Logarithmic Gecko uses a Reverse Map and a Page Valid¬ 
ity Bitmap (PVB) similarly to Lazy Gecko. It is identical 
to Lazy Gecko in terms of how PVB is updated lazily, and 
how we resolve false positives using the Reverse Map. The 
difference is that Logarithmic Gecko stores PVB in flash as 
an LSM-tree m- As an overview, the PVB is structured as 
a series of ” sorted runs” of exponentially increasing sizes in 
flash. Each sorted run contains a sorted mapping from block 
ids to bitmaps. A bitmap has B bits, one for each physical 


page in the block, which indicate whether they are valid or 
not. The first sorted run is in a RAM-based buffer, and up¬ 
dates are made to it as discussed in Section IQ When this 
buffer fills up, it is flushed to flash, and then a merge proce¬ 
dure may commence which merges two or more sorted runs, 
as discussed in IQ To perform live-page-identification for 
a given block, we search for its id in all the sorted runs and 
apply the bitwise ”or” operator product of all its bitmaps, 
as described in Section lOl Victim-selection is discussed in 
Section r6.4l and further possible optimisations are discussed 
in Section [63] 

We emphasize that Lazy and Logarithmic Gecko are iden¬ 
tical in terms of the logic of live-page-identification. The 
crux of this section is about how to keep PVB in flash. 

6.1 Buffer Management 

Logarithmic Gecko has one buffer in RAM the size of one 
flash page. This buffer contains a sorted mapping from block 
ids to Gecko Entries. A Gecko entry consists of two fields: 
(1) a bitmap of size B, where the bit at offset i corresponding 
to whether the physical page at offset i in the block is invalid, 
and (2) one additional bit called the "erase flag”, which is 
used to indicate that a block has been erased. The erase 
flag is used for merging (see next subsection). 

Logarithmic Gecko handles application reads and writes 
just as Lazy Gecko does. However, the invalidate procedure 
from algorithm [T] is overloaded, since PVB is no longer a 
simple RAM-based bitmap. Instead, algorithm [5] is invoked. 
This algorithm is given a physical address of a page that is 
no longer valid. It firstly checks if an entry for the block 
id of the before-image is in the buffer. If not, it adds an 
entry with the block id as the key and a blank gecko entry 
(the bits in the bitmap and the erase flag are all set to 0). 
The bit in the bitmap that corresponds to the page that has 
been invalidated is then set to 1. 

Algorithm: invalidate() 

Input: physicaLaddress pa 

1 blockjd = pa.blockJd; 

2 page_offset = pa.page_offset; 

3 if Ibujjer. contains(block_id) then 

4 buffer.insert(block_id); 

5 buffer[block_id] .bitmap = blank bitmap; 

6 buffer[block_id].erase_flag = false; 

7 end 

8 buffer[block_id].bitmap[page_offset] = 1 ; 

9 if buffer is full then 

10 I flush(buffer) ; 

11 end 

Algorithm 5: 

When the buffer is filled up, its contents are flushed to 
flash and it is cleared so new entries can be written on it. 
The flush may trigger a processes that merges one or more of 
the flash-resident sorted runs. This is described in the next 
section. Note that all flash pages that comprise the LSM- 
tree are allocated on separate flash blocks that we refer to 
as Gecko Blocks. 

Resolving false positives in PVB is done using algorithm 
[3] in Section 15.2.31 The only difference is that within algo- 
rithmO the new version of the invalidate method is invoked 
instead of algorithm [T] 







When a regular data block is erased and written with new 
entries, the procedure in algorithm [6] is invoked. If there is 
already an entry corresponding to the block in the buffer, 
its erase flag is set to 1. Otherwise, an entry is inserted with 
a blank bitmap and the erase flag set to 1. 

Algorithm: HandleJ31ock_Rewritten 
Input: blockjd 

1 if buffer, contains(blockAd) then 

2 buffer.insert(block_id); 

3 buffer[blockJd] .bitmap = blank bitmap; 

4 end 

5 buffer[blockjd].erase_flag = true; 

Algorithm 6: 

6.2 Merging 

In the last subsection, we saw that when the buffer fills 
up, it is flushed to a gecko block on flash. We call the flushed 
page a ’’sorted run” of size 1, and we say that it is at level 
1 of the LSM-tree. (We consider the RAM-based buffer to 
be at level 0.) 

As any LSM-tree, the Logarithmic Gecko tree contains 
multiple levels, and we denote the n**' level as Ln- There is 
typically either 0 or 1 sorted run per level. The LSM-tree 
has a tuning parameter T, which determines the size ratio of 
sorted runs in any two adjacent levels of the tree. A sorted 
run at level i contains between T'‘~^ and T* — 1 flash pages. 
We discuss the impact of the parameter T on performance 
in Section [T] 

If there is more than one sorted run at level i, a merge 
procedure is triggered. This procedure allocates two input 
buffers and one output buffer in RAM. It stores the resulting 
run in level i or i -|- 1 depending on how many pages it has, 
and disposes of the original runs. Thus, a merge may be 
invoked after a flush, and it may continue recursively based 
on the state of the tree. 

During a merge, if two sorted runs contain entries with the 
same block id, the following rule in algorithm [7] is followed. 
We assume that entryl is from the more recently created 
tree. The erase flag is used to discard any entries from before 
the last time this block was erased. Otherwise, the bitwise 
or operation is used to merge the bitmaps. 

Algorithm: merge entries 
Input: entryl, entry2 

1 if entryl.erasflag == true then 

2 I return entryl ; 

3 else 

4 entryl.bitmap = entryl.bitmap or entry2.bitmap 

5 entryl.erase Jlag = entry2.erase_flag 

6 return entryl ; 

7 end 

Algorithm 7: 

We maintain a RAM-based data structure called the Log¬ 
arithmic Gecko Mapping Directory (LGMD), which keeps 
track of the whereabouts of all flash pages belonging to the 
LSM-tree. For each entry, we also include the values of the 
first key within the page. 

6.3 Live-Page-Identification 


We can use the LSM-tree to perform live-page-identification 
as follows. When a victim block has been selected, we search 
the sorted runs one by one starting at the lowest level either 
until we have searched all of them, or until we encounter 
an entry whose erase flag is set to true. Searching a level 
requires exactly 1 lO because a run is sorted and we can use 
LGMD to infer the only candidate page in the run that can 
store the key, since LGMD stores the starting key of each 
page in the run. 

Once we have finished the search, we perform a bitwise or 
operation on all of the bitmaps we found. We then invoke 
algorithm 0] to resolve any remaining false positives in the 
resulting bitmap. This gives us an up-to-date image of which 
pages are currently live in the block. 

6.4 Victim-Selection 

Recall from Section U that in our system there is a pool 
of free blocks. When this pool is nearly empty, the garbage- 
collection mechanism is invoked. We now discuss how this 
mechanism performs victim-selection. 

There are four types of blocks in Logarithmic Gecko: trans¬ 
lation blocks (storing the conventional logical to physical 
page mapping), reverse Blocks (storing the reverse page 
mapping). Gecko blocks (storing the LSM-tree), and data 
blocks, (storing user data). For convenience, we refer to 
translation, reverse and Gecko blocks collectively as inter¬ 
nal blocks. Note that internal blocks occupy less than 1% 
of all blocks on the system. 

For the purpose of victim-selection, Logarithmic Gecko 
treats data blocks and internal blocks differently. For inter¬ 
nal blocks, page validity bitmaps are stored in RAM. They 
are maintained lazily using Lazy Gecko. 

For data blocks, we apply the LRU policy using a simple 
queue, called the Data Block Queue (DBQ). When a block 
is erased and rewritten, its id is inserted into the queue. For 
victim-selection, we pop the top of the queue and search the 
Logarithmic Gecko LSM-tree with the block id as the key as 
in Section IQ This gives us the page validity map for the 
candidate block, and we cache it in RAM. 

The greedy strategy is ultimately applied to pick the block 
with the least number of live page from among the inter¬ 
nal blocks and the cached data block candidate. Note if 
the cached data block candidate contains cold data, it may 
never be picked. We apply a rule whereby the candidate 
is discarded if it is not picked after 3 victim-selection pro¬ 
cesses. Note that if the cached data block is selected, we 
immediately start searching for the next data block candi¬ 
date (by popping the DBQ and looking up the block id in 
the LSM-tree). 

6.5 Further Possible Optimisations 

6.5.1 Compression 

We can reduce the 10 overhead of merging the tree by 
compressing the bitmaps. Our key insight here is that when 
the LSM-tree buffer is flushed, most of the bitmaps within 
it, the vast majority of the entries only have 1 bit set to 1. 
This is because the number entries that fits into the buffer is 
vastly smaller than the number of blocks in the system. We 
can exploit this by not storing a full bitmap (e.g. 16 bytes 
for a 128 page Block), but only storing the offsets of the 
pages that are invalid for the first few levels of the tree (e.g. 

4 bytes). This allows us to store four times more entries in 




the buffer before we need to flush it, thereby significantly 
redncing the number of LSM-tree 

6.5.2 Multi-way Merge 

As mentioned in Section[6]2] the merging of adjacent levels 
may continue recursively, as long as the result of one merge 
leads to the existence of more than one runs in the next 
bigger level. Note that this is wasteful in terms of lOs, as we 
continue merging the same data from the lower runs several 
times. We can reduce the number of lOs by pro-actively 
determining how far many levels the recursive merge will 
have encompassed, and instead performing a multi-way sort 
merge. The new criteria for a run at level i to participate in 
a merge is if: (1) it is not already participating in another 
merge, (2) there is at least one run at level i — 1 participating 
in this merge, and (3) all the runs at level i — 1 which are 
participating in the merge have a combined size of at least 
s = (r‘ — T*“^). These rules are simple. The only downside 
is that more input buffers are needed in RAM to perform 
the multi-way merge. If L is the number of levels in the tree, 
then we need at most L buffers. 

6.5.3 Flash-Resident Queue 

Finally, the DBQ may occupy a substantial amount of 
RAM relative to the other RAM-resident data structures 
in Logarithmic Gecko. For instance, for a device with the 
dimensions of the Micron P420 in table[T] the DBQ takes up 
1 MB of RAM. Luckily, it is very easy to store most of this 
queue in flash. We use an input buffer into which blocks are 
appended when they are erased. When it runs out of free 
space, it is flushed to flash. A RAM-based structure called 
the Queue Directory keeps track of which flash pages belong 
to the DBQ. There is also an input buffer which contains the 
block ids that were least recently written. Block ids can be 
popped from this input buffer as candidates for garbage- 
collection. When the input buffer runs out of space, we use 
the DBQ to read the next queue page. The 10 cost of this 
technique is negligible, as only 1 read lO and 1 write lO are 
needed for every P a block rewritten (4096 for the Micron 
P420m). 

7. ANALYSIS 

Table m shows a breakdown of the minimum amount of 
RAM needed by the different RAM-based data structures 
in Lazy and Logarithmic Gecko for our two example flash 
devices. The formulas and figures are derived using the the 
terms and values in table [T] The GMT is not listed, because 
it does not require a strict minimum amount of RAM. It is 
assumed that any leftover RAM in the system is allocated 
to the GMT. 

Table [2] shows that Lazy Gecko’s RAM consumption for 
both devices is in the order of several megabytes. The domi¬ 
nant RAM occupant is the PVB. The other data structures, 
the GMD and the RMD, which store the whereabouts of 
translation pages for the global mapping table and the re¬ 
verse mapping table, are common to Lazy Gecko and Loga¬ 
rithmic Gecko. 

Logarithmic Gecko requires significantly less RAM than 
Lazy Gecko because it stores the PVB in flash. However, 
Logarithmic Gecko stores several other RAM-based data 
structures to support the flash-based LSM-tree and queue. 
The Logarithmic Gecko Mapping directory fSection l6.2l) and 
the queue mapping directory fsection l6.5.3l) keep track of the 


whereabouts of all flash pages containing pages belonging to 
the LSM-tree or to the data block queue. These mappings 
are small because the number of flash pages they keep track 
of is relatively small. As we saw in Section [GT] Logarithmic 
Gecko uses several cached bitmaps for blocks that host trans¬ 
lation pages and LSM-tree pages, but since there are few 
such blocks, these bitmaps consume little RAM. We omit 
the calculation for the RAM-consumption of these bitmaps 
from the table because it is cumbersome. Finally, logarith¬ 
mic Gecko uses significantly more page buffers than Lazy 
Gecko to support the multi-way merge (L-l-1 buffers), the 
data block queue (one input and one output buffer) buffers, 
and the LSM-tree (one input buffer). 

Interestingly, the relative amount of RAM saved is much 
higher for the larger Micron device. The reason is that the 
larger device has far more flash pages. The PVB grows in 
proportion to the number of pages, but all the other struc¬ 
tures grow at a slower rate. Thus, as the number of pages in 
an SSD increases, the more relative saving we get in RAM 
due to logarithmic Gecko. 

Is the magnitude of RAM-saving by Logarithmic Gecko 
significant? To answer this, let us compare it to the magni¬ 
tude of RAM-saving that the original DFTL enabled. Con¬ 
sider the minimal amount of RAM needed to store a pure 
RAM-based mapping where all mapping entries are in RAM: 
y = K ■ B ■ A bytes assuming 4 bytes per entry. Compare 
this to DFTL, where under Lazy Gecko, we need at least 
((a; = K ■ B-)/8 bytes). DFTL under Lazy Gecko allows re¬ 
ducing the RAM print a factor of only up to x/y = 1/32, a 
97% improvement. Logarithmic Gecko is capable of reduc¬ 
ing this RAM footprint by a further 97% on top. Thus, the 
magnitude of the reduction in RAM-consumption of moving 
from Lazy Gecko to Logarithmic Gecko is equivalent to the 
magnitude in RAM-reduction that DFTL enabled on the 
first place. 

In exchange for the lower RAM requirements. Logarithmic 
Gecko introduces some lO overheads relative to Lazy Gecko, 
which we capture in table (3] Let us start with the cost of an 
application write, which involves an insertion into the LSM- 
tree buffer. It is known that the cost of an insertion into an 
LSM-tree is 0{T/D log/D)), where N is the number of 
entries in the tree, D is the number of entries that fit into one 
page, and T is the size ratio between adjacent levels in the 
tree [iniis]- In the case of Logarithmic Gecko’s LSM-tree, 
N is equal to the number of blocks in the system K, and D is 
roughly equal to P/B, the number of block bitmaps that fit 
into one flash page. Thus, the cost of an application write 
in lazy gecko is Note that this expression 

is much lower than 1. In our experiments, the contribution 
of the LSM-tree to write-amplification was not greater than 
3%. 

Let us now consider additional overheads introduced due 
to the two schemes during garbage-collection. Both Log¬ 
arithmic Gecko and Lazy Gecko involve a cost of 1 flash 
read and 1 flash write per garbage-collection operation from 
and reading and rewriting a page from the reverse map. 
Logarithmic Gecko is associated with an additional cost 
for searching the LSM-tree to reconstruct the block valid¬ 
ity bitmap for a candidate data block. In the worst case, 
each level of the tree must be searched, and searching each 
level involves at most 1 10. Thus, the worst case cost is the 
number of levels in the tree: logrll^^). 




Table 2: Ram-resident data structures for DFTL with Logarithmic Gecko 


Scheme 

data structure 

Size (bytes) 

Micron 

P420m 

Intel 525 se¬ 
ries 

DFTL with 

Lazy Gecko 

Global Mapping Directory 

a(LRA/(P/a)) 

90 KB 

22 KB 

Reverse Mapping Directory 

a(PRA/(P/a)) 

128 KB 

32 KB 

Page Validity Bitmap 


16 MB 

1 MB 

total 


« 16.5 MB 

« 1.1 MB 

DFTL with 
Logarithmic Gecko 

Global Mapping Directory 

a{LBA/{P/a)) 

90 KB 

22 KB 

Reverse Mapping Directory 

a(PBA/(P/a)) 

128 KB 

32 KB 

Gecko Mapping Directory 

2a(2.(A/(P/(a+(P/8))))) 

8.5 KB 

2.5 KB 

Queue Directory 

2a-{(K-a)/P) 

512 B 

512 B 

Cached bitmaps 


15 KB 

4 KB 

page buffers 

P ■ (4 + L) 

241 KB 

53 KB 

total 


« 482 KB 

« 112 KB 


Ratio 

total logarithmic / total 
lazy 

« 3% 

« 11% 


Table 3: Comparison of overheads for different garbage-collection techniques 


technique 

overheads for a write lO 

flash read flash writes 

overheads for a GC operation 

flash read flash writes 

Lazy Gecko 

Logarithmic Gecko 

0 0 

0 0(2^(05t(^)) 

1 1 

1 + OilogTi^)) 1 



(RAM needed to store entire page mapping) / H' 

Figure 1: Impact on write-amplification for different 
GC schemes as we decrease RAM 


8. EVALUATION 

We used the SSD simulator EagleTree [5] to simulate an 
SSD similar to the Intel 525 series. It has the same features 
as in Table [T] OP was set to 30%. In our experiment, we 
varied the amount of RAM given to the SSD. We exper¬ 
imented with Lazy Gecko, Logarithmic Gecko, and a the¬ 
oretical implementation of Lazy Gecko where all available 
RAM is used on the GMT. The workload we used consisted 
of uniformly randomly distributed writes across the logical 
address space. In Figure[T] we see how the different schemes 
allow the simulated SSD to scale. We start with an amount 
of RAM sufficient to store the entire page mapping in GMT, 
and halve it each time. We measure write-amplification. 

The theoretical implementation of Lazy Gecko increases in 
write-amplification as we decrease RAM due to more evic¬ 
tions from GMT but eventually levels off as we approach 
the theoretical worst case where each page write involves 1 
eviction, so write-amplification is essentially doubled. Lazy 
Gecko performs as well as the theoretical optimal implemen¬ 
tation, but it can’t scale as the amount of RAM decreases. 


Logarithmic Gecko is able to scale to far lower levels of RAM 
than Lazy Gecko. Note that a bigger SSD such as the Mi¬ 
cron P420m would be able to scale to even lower levels of 
RAM relative to Lazy Gecko, as we saw in Table[2] The dot¬ 
ted blue line shows the contribution of the LSM-tree writes 
and reverse map writes to write-amplification for logarithmic 
Gecko, which is relatively low. Garbage-collection and evic¬ 
tions from GMT constitute the bulk of write-amplification. 

We also examined the impact of Logarithmic Gecko on 
read-amplification, or the number of internal reads that take 
place for each application write. The factors that contribute 
to read-amplification are (1) garbage-collection reads, (2) 
LSM-tree lookups and (3) mapping reads (both direct and 
reverse maps). Of these 3 factors in our experiment, contri¬ 
bution of the LSM-tree lookups to read-amplification ranges 
from 12% when RAM is plentiful down to 3% when RAM 
is scarce. This point is that the overheads of Logarithmic 
Gecko are just a small fraction of the overheads we would 
have had to pay anyway due to mapping reads and garbage- 
collection reads. 

9. CONCLUSION 

We introduced the problem of maintaining the metadata 
needed to perform victim-selection and live-page-identification 
for garbage-collection in the context of flash-resident page- 
associative schemes. Lazy Gecko was introduced to solve 
this problem. It entails a modest lO overhead, and it re¬ 
quires storing a relatively large bitmap in RAM. We showed 
that this bitmap may introduce a scalability issue, and intro¬ 
duced Logarithmic Gecko, which stores this bitmap in flash 
as an LSM-tree. Logarithmic Gecko is able to scale to far 
lower quantities of RAM relative to the size of the SSD, and 
it only introduces a modest 10 overhead due to maintaining 
and querying the LSM-tree. 
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